project subject :
Building different forecasting models for forecasting is one of the variables of dataset.
The project is coded with R
project based on Crisp_dm model.
Cross-industry standard process for data mining, known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model.
CRISP-DM breaks the process of data mining into six major phases:
These two steps overlap, so let’s check together.
Questions to understand the problem
Answer:This project is to predict the number of applications received by various American universities. The output of this project can be used to separate different universities from each other, these results can be attractive for the Ministry of Education. Popular universities can be identified to give them more funding or to plan for other universities to reach the level of top universities, and it can be attractive for students to choose a university that has a chance of being accepted.
College : U.S. News and World Report’s College Data
Statistics for a large number of US Colleges from the 1995 issue of US News and World Report.https://rdrr.io/cran/ISLR/man/College.html
college_data <- read.csv("college.csv")
head(college_data)
## X Private Apps Accept Enroll Top10perc Top25perc
## 1 Abilene Christian University Yes 1660 1232 721 23 52
## 2 Adelphi University Yes 2186 1924 512 16 29
## 3 Adrian College Yes 1428 1097 336 22 50
## 4 Agnes Scott College Yes 417 349 137 60 89
## 5 Alaska Pacific University Yes 193 146 55 16 44
## 6 Albertson College Yes 587 479 158 38 62
## F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal
## 1 2885 537 7440 3300 450 2200 70 78
## 2 2683 1227 12280 6450 750 1500 29 30
## 3 1036 99 11250 3750 400 1165 53 66
## 4 510 63 12960 5450 450 875 92 97
## 5 249 869 7560 4120 800 1500 76 72
## 6 678 41 13500 3335 500 675 67 73
## S.F.Ratio perc.alumni Expend Grad.Rate
## 1 18.1 12 7041 60
## 2 12.2 16 10527 56
## 3 12.9 30 8735 54
## 4 7.7 37 19016 59
## 5 11.9 2 10922 15
## 6 9.4 11 9727 55
# Remove the first column
college_data <- college_data[,-1]
dim(college_data)
## [1] 777 18
A data frame with 777 observations on the following 18 variables.
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the ASA Statistical Graphics Section’s 1995 Data Analysis Exposition.
class(college_data)
## [1] "data.frame"
str(college_data)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Apps : int 1660 2186 1428 417 193 587 353 1899 1038 582 ...
## $ Accept : int 1232 1924 1097 349 146 479 340 1720 839 498 ...
## $ Enroll : int 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : int 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : int 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: int 2885 2683 1036 510 249 678 416 1594 973 799 ...
## $ P.Undergrad: int 537 1227 99 63 869 41 230 32 306 78 ...
## $ Outstate : int 7440 12280 11250 12960 7560 13500 13290 13868 15595 10468 ...
## $ Room.Board : int 3300 6450 3750 5450 4120 3335 5720 4826 4400 3380 ...
## $ Books : int 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : int 2200 1500 1165 875 1500 675 1500 850 500 1800 ...
## $ PhD : int 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : int 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: int 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : int 7041 10527 8735 19016 10922 9727 8861 11487 11644 8991 ...
## $ Grad.Rate : int 60 56 54 59 15 55 63 73 80 52 ...
Categorical variable :
Integer variables :
Real number :
Percentage numbers :
The range of these numbers is between 0 and 100
summary(college_data)
## Private Apps Accept Enroll
## Length:777 Min. : 81 Min. : 72 Min. : 35
## Class :character 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242
## Mode :character Median : 1558 Median : 1110 Median : 434
## Mean : 3002 Mean : 2019 Mean : 780
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902
## Max. :48094 Max. :26330 Max. :6392
## Top10perc Top25perc F.Undergrad P.Undergrad
## Min. : 1.00 Min. : 9.0 Min. : 139 Min. : 1.0
## 1st Qu.:15.00 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0
## Median :23.00 Median : 54.0 Median : 1707 Median : 353.0
## Mean :27.56 Mean : 55.8 Mean : 3700 Mean : 855.3
## 3rd Qu.:35.00 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0
## Max. :96.00 Max. :100.0 Max. :31643 Max. :21836.0
## Outstate Room.Board Books Personal
## Min. : 2340 Min. :1780 Min. : 96.0 Min. : 250
## 1st Qu.: 7320 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850
## Median : 9990 Median :4200 Median : 500.0 Median :1200
## Mean :10441 Mean :4358 Mean : 549.4 Mean :1341
## 3rd Qu.:12925 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700
## Max. :21700 Max. :8124 Max. :2340.0 Max. :6800
## PhD Terminal S.F.Ratio perc.alumni
## Min. : 8.00 Min. : 24.0 Min. : 2.50 Min. : 0.00
## 1st Qu.: 62.00 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00
## Median : 75.00 Median : 82.0 Median :13.60 Median :21.00
## Mean : 72.66 Mean : 79.7 Mean :14.09 Mean :22.74
## 3rd Qu.: 85.00 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00
## Max. :103.00 Max. :100.0 Max. :39.80 Max. :64.00
## Expend Grad.Rate
## Min. : 3186 Min. : 10.00
## 1st Qu.: 6751 1st Qu.: 53.00
## Median : 8377 Median : 65.00
## Mean : 9660 Mean : 65.46
## 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :56233 Max. :118.00
Note: privet must be convert to categorical. and PhD and Grad.Rate variables have exceeded their limit
#convert to factor
college_data$Private <- factor(college_data$Private)
summary(college_data$Private)
## No Yes
## 212 565
library("ggplot2")
ggplot(data = college_data, aes(x = Private , y = stat(prop * 100), group = 1)) +
geom_bar(fill= c("dark blue", "dark red")) +
ylab("percentage") +
scale_y_continuous(breaks = seq(0,100,10),limits = c(0,100)) +
ggtitle("Private university or not")+
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
| private | number | percent |
|---|---|---|
| NO | 212 | 27.3% |
| YES | 565 | 72.7% |
Note: We are not symmetrically connected with different universities.
summary(college_data$Apps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 81 776 1558 3002 3624 48094
#Test of normality
#1.Histogram
hist(college_data$Apps, probability = T, breaks = 20,
main = "Apps hist", xlab = "Apps")
lines(density(college_data$Apps), col = "red")
#2.QQplot
qqnorm(college_data$Apps, main = "QQ plot of Apps", pch = 20)
qqline(college_data$Apps, col = "red")
#3.Test Skewness and Kurtosis
library("moments")
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Apps)
##
## Jarque-Bera Normality Test
##
## data: college_data$Apps
## JB = 24687, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Apps)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$Apps
## kurt = 29.595, z = 15.195, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
college_data[college_data$Apps > 22000,]
## Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 484 No 48094 26330 4520 36 79 21401 3712
## Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 484 7410 4748 690 2009 90 95 19.5 19
## Expend Grad.Rate
## 484 10474 77
college_data <- college_data[-which(row.names(college_data)== 484), ]
hist(college_data$Apps, probability = T, breaks = 20,main = "Apps hist", xlab = "Apps", ylim = c(0,0.00025))
lines(density(college_data$Apps), col = "red")
conclusion: reject normality assumption.
boxplot(college_data$Apps, main = "Apps boxplot")
Note: The Apps variable has a lot of Outliers.
summary(college_data$Accept)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 72.0 603.2 1109.5 1987.5 2407.5 18744.0
#Test of normality
#1.Histogram
hist(college_data$Accept, probability = T, breaks = 15,
main = "Accept hist", xlab = "Accept")
lines(density(college_data$Accept), col = "red")
#2.QQplot
qqnorm(college_data$Accept, main = "QQ plot of Accept", pch = 20)
qqline(college_data$Accept, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Accept)
##
## Jarque-Bera Normality Test
##
## data: college_data$Accept
## JB = 3759.8, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Accept)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$Accept
## kurt = 12.359, z = 11.912, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption.
boxplot(college_data$Accept, main = "Accept boxplot")
Note: The Accept variable has a lot of Outliers.
summary(college_data$Enroll)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.0 242.0 434.0 775.2 893.8 6392.0
#Test of normality
#1.Histogram
hist(college_data$Enroll, probability = T, breaks = 15,
main = "Enroll hist", xlab = "Enroll")
lines(density(college_data$Enroll), col = "red")
#2.QQplot
qqnorm(college_data$Enroll, main = "QQ plot of Enroll", pch = 20)
qqline(college_data$Enroll, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Enroll)
##
## Jarque-Bera Normality Test
##
## data: college_data$Enroll
## JB = 3539.4, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Enroll)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$Enroll
## kurt = 11.963, z = 11.761, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption.
boxplot(college_data$Enroll, main = "Enroll boxplot")
Note: The Enroll variable has a lot of Outliers.
summary(college_data$Top10perc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 15.00 23.00 27.55 35.00 96.00
#Test of normality
#1.Histogram
hist(college_data$Top10perc, probability = T, breaks = 15,
main = "Top10perc hist", xlab = "Top10perc")
lines(density(college_data$Top10perc), col = "red")
#2.QQplot
qqnorm(college_data$Top10perc, main = "QQ plot of Top10perc", pch = 20)
qqline(college_data$Top10perc, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Top10perc)
##
## Jarque-Bera Normality Test
##
## data: college_data$Top10perc
## JB = 412.33, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Top10perc)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$Top10perc
## kurt = 5.1860, z = 6.5872, p-value = 4.482e-11
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption.
boxplot(college_data$Top10perc, main = "Top10perc boxplot")
Note: The Top10perc variable has a lot of Outliers.
summary(college_data$Top25perc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 41.00 54.00 55.77 69.00 100.00
#Test of normality
#1.Histogram
hist(college_data$Top25perc, probability = T, breaks = 15,
main = "Top25perc hist", xlab = "Top25perc")
lines(density(college_data$Top25perc), col = "red")
#2.QQplot
qqnorm(college_data$Top25perc, main = "QQ plot of Top25perc", pch = 20)
qqline(college_data$Top25perc, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Top25perc)
##
## Jarque-Bera Normality Test
##
## data: college_data$Top25perc
## JB = 19.135, p-value = 6.995e-05
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Top25perc)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$Top25perc
## kurt = 2.4364, z = -4.4374, p-value = 9.103e-06
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption.
boxplot(college_data$Top25perc, main = "Top25perc boxplot")
Note: The Top25perc variable hasn’t Outliers.
summary(college_data$F.Undergrad)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 139 991 1707 3677 3969 31643
#Test of normality
#1.Histogram
hist(college_data$F.Undergrad, probability = T, breaks = 15,
main = "F.Undergrad hist", xlab = "F.Undergrad")
lines(density(college_data$F.Undergrad), col = "red")
#2.QQplot
qqnorm(college_data$F.Undergrad, main = "QQ plot of F.Undergrad", pch = 20)
qqline(college_data$F.Undergrad, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$F.Undergrad)
##
## Jarque-Bera Normality Test
##
## data: college_data$F.Undergrad
## JB = 2863.3, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$F.Undergrad)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$F.Undergrad
## kurt = 10.814, z = 11.275, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption.
boxplot(college_data$F.Undergrad, main = "F.Undergrad boxplot")
Note: The F.Undergrad variable has a lot of Outliers.
summary(college_data$P.Undergrad)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 95.0 352.5 851.6 964.0 21836.0
#Test of normality
#1.Histogram
hist(college_data$P.Undergrad, probability = T, breaks = 15,
main = "P.Undergrad hist", xlab = "P.Undergrad")
lines(density(college_data$P.Undergrad), col = "red")
#2.QQplot
qqnorm(college_data$P.Undergrad, main = "QQ plot of P.Undergrad", pch = 20)
qqline(college_data$P.Undergrad, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$P.Undergrad)
##
## Jarque-Bera Normality Test
##
## data: college_data$P.Undergrad
## JB = 102622, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$P.Undergrad)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$P.Undergrad
## kurt = 58.165, z = 17.007, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption.
boxplot(college_data$P.Undergrad, main = "P.Undergrad boxplot")
Note: The P.Undergrad variable has a lot of Outliers.
summary(college_data$Outstate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2340 7305 9990 10445 12931 21700
#Test of normality
#1.Histogram
hist(college_data$Outstate, probability = T, breaks = 15,
main = "Outstate hist", xlab = "Outstate")
lines(density(college_data$Outstate), col = "red")
#2.QQplot
qqnorm(college_data$Outstate, main = "QQ plot of Outstate", pch = 20)
qqline(college_data$Outstate, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Outstate)
##
## Jarque-Bera Normality Test
##
## data: college_data$Outstate
## JB = 38.861, p-value = 3.642e-09
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Outstate)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$Outstate
## kurt = 2.5792, z = -2.9480, p-value = 0.003199
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption.
boxplot(college_data$Outstate, main = "Outstate boxplot")
Note: The P.Undergrad variable has a One Outliers.
summary(college_data$Room.Board)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1780 3596 4198 4357 5050 8124
#Test of normality
#1.Histogram
hist(college_data$Room.Board, probability = T, breaks = 15,
main = "Room.Board hist", xlab = "Room.Board")
lines(density(college_data$Room.Board), col = "red")
#2.QQplot
qqnorm(college_data$Room.Board, main = "QQ plot of Room.Board", pch = 20)
qqline(college_data$Room.Board, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Room.Board)
##
## Jarque-Bera Normality Test
##
## data: college_data$Room.Board
## JB = 30.737, p-value = 2.116e-07
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Room.Board)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$Room.Board
## kurt = 2.8041, z = -1.1201, p-value = 0.2627
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption.
boxplot(college_data$Room.Board, main = "Room.Board boxplot")
Note: The Room.Board variable has a lot of Outliers.
summary(college_data$Books)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 96.0 469.5 500.0 549.2 600.0 2340.0
#Test of normality
#1.Histogram
hist(college_data$Books, probability = T, breaks = 15,
main = "Books hist", xlab = "Books", ylim = c(0, 0.005))
lines(density(college_data$Books), col = "red")
#2.QQplot
qqnorm(college_data$Books, main = "QQ plot of Books", pch = 20)
qqline(college_data$Books, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Books)
##
## Jarque-Bera Normality Test
##
## data: college_data$Books
## JB = 27239, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Books)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$Books
## kurt = 31.176, z = 15.344, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption.
boxplot(college_data$Books, main = "Books boxplot")
Note: The Books variable has a lot of Outliers.
summary(college_data$Personal)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 850 1200 1340 1692 6800
#Test of normality
#1.Histogram
hist(college_data$Personal, probability = T, breaks = 15,
main = "Personal hist", xlab = "Personal")
lines(density(college_data$Personal), col = "red")
#2.QQplot
qqnorm(college_data$Personal, main = "QQ plot of Personal", pch = 20)
qqline(college_data$Personal, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Personal)
##
## Jarque-Bera Normality Test
##
## data: college_data$Personal
## JB = 2018.9, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Personal)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$Personal
## kurt = 10.091, z = 10.926, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption.
boxplot(college_data$Personal, main = "Personal boxplot")
Note: The Books variable has a lot of Outliers.
summary(college_data$PhD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 62.00 75.00 72.64 85.00 103.00
Note: The PhD variables have exceeded their limit.
#Test of normality
#1.Histogram
hist(college_data$PhD, probability = T, breaks = 15,
main = "PhD hist", xlab = "PhD")
lines(density(college_data$PhD), col = "red")
#2.QQplot
qqnorm(college_data$PhD, main = "QQ plot of PhD", pch = 20)
qqline(college_data$PhD, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$PhD)
##
## Jarque-Bera Normality Test
##
## data: college_data$PhD
## JB = 85.651, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$PhD)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$PhD
## kurt = 3.5534, z = 2.6620, p-value = 0.007767
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption. PhD variables have exceeded their limit.
boxplot(college_data$PhD, main = "PhD boxplot")
Note: The Books variable has a lot of Outliers.
college_data <- college_data[-which(college_data$PhD > 100), ]
summary(college_data$Terminal)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.00 71.00 82.00 79.67 92.00 100.00
#Test of normality
#1.Histogram
hist(college_data$Terminal, probability = T, breaks = 15,
main = "Terminal hist", xlab = "Terminal")
lines(density(college_data$Terminal), col = "red")
#2.QQplot
qqnorm(college_data$Terminal, main = "QQ plot of Terminal", pch = 20)
qqline(college_data$Terminal, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Terminal)
##
## Jarque-Bera Normality Test
##
## data: college_data$Terminal
## JB = 86.756, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Terminal)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$Terminal
## kurt = 3.2277, z = 1.3114, p-value = 0.1897
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption.
boxplot(college_data$Terminal, main = "Terminal boxplot")
Note: The Terminal variable has a lot of Outliers.
summary(college_data$S.F.Ratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.50 11.50 13.60 14.08 16.45 39.80
#Test of normality
#1.Histogram
hist(college_data$S.F.Ratio, probability = T, breaks = 15,
main = "S.F.Ratio hist", xlab = "S.F.Ratio")
lines(density(college_data$S.F.Ratio), col = "red")
#2.QQplot
qqnorm(college_data$S.F.Ratio, main = "QQ plot of S.F.Ratio", pch = 20)
qqline(college_data$S.F.Ratio, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$S.F.Ratio)
##
## Jarque-Bera Normality Test
##
## data: college_data$S.F.Ratio
## JB = 270.49, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$S.F.Ratio)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$S.F.Ratio
## kurt = 5.5621, z = 7.1525, p-value = 8.521e-13
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption.
boxplot(college_data$S.F.Ratio, main = "S.F.Ratio boxplot")
Note: The S.F.Ratio variable has a lot of Outliers.
summary(college_data$perc.alumni)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 13.00 21.00 22.76 31.00 64.00
#Test of normality
#1.Histogram
hist(college_data$perc.alumni, probability = T, breaks = 15,
main = "perc.alumni hist", xlab = "perc.alumni")
lines(density(college_data$perc.alumni), col = "red")
#2.QQplot
qqnorm(college_data$perc.alumni, main = "QQ plot of perc.alumni", pch = 20)
qqline(college_data$perc.alumni, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$perc.alumni)
##
## Jarque-Bera Normality Test
##
## data: college_data$perc.alumni
## JB = 47.266, p-value = 5.448e-11
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$perc.alumni)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$perc.alumni
## kurt = 2.88878, z = -0.54807, p-value = 0.5836
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption.
boxplot(college_data$perc.alumni, main = "perc.alumni boxplot")
Note: The S.F.Ratio variable has a few of Outliers.
summary(college_data$Expend)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3186 6754 8377 9663 10847 56233
#Test of normality
#1.Histogram
hist(college_data$Expend, probability = T, breaks = 15,
main = "Expend hist", xlab = "Expend")
lines(density(college_data$Expend), col = "red")
#2.QQplot
qqnorm(college_data$Expend, main = "QQ plot of Expend", pch = 20)
qqline(college_data$Expend, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Expend)
##
## Jarque-Bera Normality Test
##
## data: college_data$Expend
## JB = 12711, p-value < 2.2e-16
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Expend)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$Expend
## kurt = 21.602, z = 14.144, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
conclusion: reject normality assumption.
boxplot(college_data$Expend, main = "Expend boxplot")
Note: The S.F.Ratio variable has a lot of Outliers.
summary(college_data$Grad.Rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 53.00 65.00 65.48 78.00 118.00
Note: The Grad.Rate variables have exceeded their limit.
#Test of normality
#1.Histogram
hist(college_data$Grad.Rate, probability = T, breaks = 15,
main = "Grad.Rate hist", xlab = "Grad.Rate")
lines(density(college_data$Grad.Rate), col = "red")
#2.QQplot
qqnorm(college_data$Grad.Rate, main = "QQ plot of Grad.Rate", pch = 20)
qqline(college_data$Grad.Rate, col = "red")
#3.Test Skewness and Kurtosis
#3_1.Jarque_Bera Test (Skewness = 0 ?)
#p_value < 0.05 reject normality assumption.
jarque.test(college_data$Grad.Rate)
##
## Jarque-Bera Normality Test
##
## data: college_data$Grad.Rate
## JB = 3.0568, p-value = 0.2169
## alternative hypothesis: greater
#3_2.Anscomb_Glynn Test(Kurtosis = 3 ?)
#_pvalue < 0.05 reject normality assumption.
anscombe.test(college_data$Grad.Rate)
##
## Anscombe-Glynn kurtosis test
##
## data: college_data$Grad.Rate
## kurt = 2.7931, z = -1.1975, p-value = 0.2311
## alternative hypothesis: kurtosis is not equal to 3
conclusion: accept normality assumption. The Grad.Rate variable follows the normal distribution. Grad.Rate variables have exceeded their limit.
boxplot(college_data$Grad.Rate, main = "Grad.Rate boxplot")
Note: The S.F.Ratio variable has a few of Outliers.
college_data <- college_data[-which(college_data$Grad.Rate > 100), ]
par(mfrow = c(2, 3))
for (i in 2:7) {
hist(college_data[,i], xlab ="", main = colnames(college_data)[i],probability = T)
lines(density(college_data[,i]), col = "red")
}
for (i in 8:13) {
hist(college_data[,i], xlab ="", main = colnames(college_data)[i],probability = T)
lines(density(college_data[,i]), col = "red")
}
for (i in 14:18) {
hist(college_data[,i], xlab ="", main = colnames(college_data)[i],probability = T)
lines(density(college_data[,i]), col = "red")
}
#reset graphical parameters.
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2) + 0.1, cex.main = 1)
par(mfrow = c(2, 3), mar = c(2,3, 2, 1.5) )
for (i in 2:7) {
boxplot(college_data[,i], xlab ="", main = colnames(college_data)[i])
}
for (i in 8:13) {
boxplot(college_data[,i], xlab ="", main = colnames(college_data)[i])
}
for (i in 14:18) {
boxplot(college_data[,i], xlab ="", main = colnames(college_data)[i])
}
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2) + 0.1, cex.main = 1)
tapply(college_data$Apps,college_data$Private,mean)
## No Yes
## 5552.952 1974.615
Conclusion: Private universities received more applications.
#Histogram for all data in one plot
ggplot(college_data, aes(x = Apps, fill = Private)) +
geom_histogram(bins = 25, alpha = 0.5, position = "dodge") +
ggtitle("Distribute the Apps variable based on the Private variable") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
#Histogram for all data in two plot
ggplot(college_data, aes(x = Apps)) +
geom_histogram(bins = 25, fill = "light blue") +
facet_grid(Private ~ .) +
ggtitle("Distribute the Apps variable based on the Private variable") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
#Box plot
boxplot(Apps ~ Private, data = college_data, names = levels(college_data$Private), main = "box plot Apps based on the Private")
conclusion: The variable distribution of the Apps varies in different Privet factors
#independent 2_Group mann_whiteny U Test
wilcox.test(Apps ~ Private, data = college_data, alternative = "two.sided", paired = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Apps by Private
## W = 95738, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Note: p_value < 0.05. Indicates that Privat variable has an effect on the Apps variable
ggplot(college_data, aes(x = Accept, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("Accept vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$Accept, method = "pearson")
## [1] 0.9360829
#Normality assumption for pearson correlation
#Hypothesis test: pearson correlation
# H0 : correlation pearson = 0
# H1 : correlation pearson != 0
# p_value < 0.05 reject H0
cor.test(college_data$Apps, college_data$Accept, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$Accept
## S = 1600508, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9792897
conclusion: Apps and Accept have high correlation.
ggplot(college_data, aes(x = Enroll, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("Enroll vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$Enroll, method = "pearson")
## [1] 0.8750872
cor.test(college_data$Apps, college_data$Enroll, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$Enroll
## S = 5714525, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9260549
conclusion: Apps and Enroll have high correlation.
ggplot(college_data, aes(x = Top10perc, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("Top10perc vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$Top10perc, method = "pearson")
## [1] 0.3656929
cor.test(college_data$Apps, college_data$Top10perc, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$Top10perc
## S = 53509557, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.3075946
conclusion: Apps and Top10perc have low correlation.
ggplot(college_data, aes(x = Top25perc, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("Top25perc vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$Top25perc, method = "pearson")
## [1] 0.3685038
cor.test(college_data$Apps, college_data$Top25perc, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$Top25perc
## S = 47691767, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.3828759
conclusion: Apps and Top25perc have low correlation.
ggplot(college_data, aes(x = F.Undergrad, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("F.Undergrad vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$F.Undergrad, method = "pearson")
## [1] 0.8440157
cor.test(college_data$Apps, college_data$F.Undergrad, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$F.Undergrad
## S = 8761954, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8866217
conclusion: Apps and F.Undergrad have high correlation.
ggplot(college_data, aes(x = P.Undergrad, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("P.Undergrad vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$P.Undergrad, method = "pearson")
## [1] 0.4084307
cor.test(college_data$Apps, college_data$P.Undergrad, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$P.Undergrad
## S = 46579830, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.3972642
conclusion: Apps and P.Undergrad have low correlation.
ggplot(college_data, aes(x = Outstate, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("Outstate vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$Outstate, method = "pearson")
## [1] 0.06668958
cor.test(college_data$Apps, college_data$Outstate, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$Outstate
## S = 73412241, p-value = 0.1642
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.05005693
conclusion: There is almost no correlation between Apps and Outstate.
ggplot(college_data, aes(x = Room.Board, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("Room.Board vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$Room.Board, method = "pearson")
## [1] 0.1748123
cor.test(college_data$Apps, college_data$Room.Board, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$Room.Board
## S = 62758050, p-value = 1.389e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1879205
conclusion: Apps and Room.Board have low correlation.
ggplot(college_data, aes(x = Books, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("Books vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$Books, method = "pearson")
## [1] 0.1321529
cor.test(college_data$Apps, college_data$Books, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$Books
## S = 59497032, p-value = 9.242e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.2301176
conclusion: Apps and Books have low correlation.
ggplot(college_data, aes(x = Personal, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("Personal vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$Personal, method = "pearson")
## [1] 0.1804392
cor.test(college_data$Apps, college_data$Personal, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$Personal
## S = 62413139, p-value = 6.887e-08
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1923836
conclusion: Apps and Personal have low correlation.
ggplot(college_data, aes(x = PhD, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("PhD vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$PhD, method = "pearson")
## [1] 0.4192789
cor.test(college_data$Apps, college_data$PhD, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$PhD
## S = 35269517, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.5436179
conclusion: Apps and PhD have low correlation.
ggplot(college_data, aes(x = Terminal, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("Terminal vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$Terminal, method = "pearson")
## [1] 0.3926116
cor.test(college_data$Apps, college_data$Terminal, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$Terminal
## S = 38550829, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.5011582
conclusion: Apps and Terminal have low correlation.
ggplot(college_data, aes(x = S.F.Ratio, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("S.F.Ratio vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$S.F.Ratio, method = "pearson")
## [1] 0.08356583
cor.test(college_data$Apps, college_data$S.F.Ratio, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$S.F.Ratio
## S = 60976579, p-value = 3.089e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.2109725
conclusion: There is almost no correlation between Apps and S.F.Ratio.
ggplot(college_data, aes(x = perc.alumni, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("perc.alumni vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$perc.alumni, method = "pearson")
## [1] -0.09481275
cor.test(college_data$Apps, college_data$perc.alumni, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$perc.alumni
## S = 83696416, p-value = 0.02089
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.0830187
conclusion: There is almost no correlation between Apps and perc.alumni.
ggplot(college_data, aes(x = Expend, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("Expend vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$Expend, method = "pearson")
## [1] 0.283022
cor.test(college_data$Apps, college_data$Expend, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$Expend
## S = 62384116, p-value = 6.487e-08
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1927592
conclusion: Apps and Expend have low correlation.
ggplot(college_data, aes(x = Grad.Rate, y = Apps)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
ggtitle("Grad.Rate vs. Apps") +
theme(plot.title = element_text(hjust =0.5, face = "bold"), axis.line.x = element_line(size = 1, color = "black"),axis.line.y = element_line(size = 1, color = "black"))
## `geom_smooth()` using formula 'y ~ x'
#Correlation check
cor(college_data$Apps, college_data$Grad.Rate, method = "pearson")
## [1] 0.1494663
cor.test(college_data$Apps, college_data$Grad.Rate, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: college_data$Apps and college_data$Grad.Rate
## S = 62708263, p-value = 1.257e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1885648
conclusion: Apps and Grad.Rate have low correlation.
cor_table <- round(cor(college_data[,2:18]),2)
cor_table
## Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## Apps 1.00 0.94 0.88 0.37 0.37 0.84 0.41
## Accept 0.94 1.00 0.93 0.20 0.25 0.89 0.45
## Enroll 0.88 0.93 1.00 0.18 0.22 0.96 0.51
## Top10perc 0.37 0.20 0.18 1.00 0.89 0.14 -0.11
## Top25perc 0.37 0.25 0.22 0.89 1.00 0.19 -0.06
## F.Undergrad 0.84 0.89 0.96 0.14 0.19 1.00 0.57
## P.Undergrad 0.41 0.45 0.51 -0.11 -0.06 0.57 1.00
## Outstate 0.07 -0.02 -0.15 0.56 0.49 -0.22 -0.25
## Room.Board 0.17 0.09 -0.04 0.37 0.33 -0.07 -0.06
## Books 0.13 0.11 0.11 0.12 0.12 0.11 0.08
## Personal 0.18 0.20 0.28 -0.10 -0.08 0.31 0.32
## PhD 0.42 0.37 0.33 0.53 0.55 0.32 0.15
## Terminal 0.39 0.35 0.31 0.49 0.52 0.30 0.14
## S.F.Ratio 0.08 0.17 0.23 -0.39 -0.30 0.28 0.23
## perc.alumni -0.09 -0.17 -0.18 0.46 0.42 -0.23 -0.28
## Expend 0.28 0.13 0.06 0.66 0.53 0.02 -0.08
## Grad.Rate 0.15 0.06 -0.03 0.50 0.48 -0.08 -0.26
## Outstate Room.Board Books Personal PhD Terminal S.F.Ratio
## Apps 0.07 0.17 0.13 0.18 0.42 0.39 0.08
## Accept -0.02 0.09 0.11 0.20 0.37 0.35 0.17
## Enroll -0.15 -0.04 0.11 0.28 0.33 0.31 0.23
## Top10perc 0.56 0.37 0.12 -0.10 0.53 0.49 -0.39
## Top25perc 0.49 0.33 0.12 -0.08 0.55 0.52 -0.30
## F.Undergrad -0.22 -0.07 0.11 0.31 0.32 0.30 0.28
## P.Undergrad -0.25 -0.06 0.08 0.32 0.15 0.14 0.23
## Outstate 1.00 0.65 0.04 -0.30 0.39 0.41 -0.55
## Room.Board 0.65 1.00 0.13 -0.20 0.34 0.38 -0.36
## Books 0.04 0.13 1.00 0.18 0.03 0.10 -0.03
## Personal -0.30 -0.20 0.18 1.00 -0.01 -0.03 0.14
## PhD 0.39 0.34 0.03 -0.01 1.00 0.85 -0.14
## Terminal 0.41 0.38 0.10 -0.03 0.85 1.00 -0.16
## S.F.Ratio -0.55 -0.36 -0.03 0.14 -0.14 -0.16 1.00
## perc.alumni 0.57 0.27 -0.04 -0.29 0.25 0.27 -0.40
## Expend 0.67 0.50 0.11 -0.10 0.44 0.44 -0.58
## Grad.Rate 0.58 0.42 0.00 -0.27 0.32 0.30 -0.31
## perc.alumni Expend Grad.Rate
## Apps -0.09 0.28 0.15
## Accept -0.17 0.13 0.06
## Enroll -0.18 0.06 -0.03
## Top10perc 0.46 0.66 0.50
## Top25perc 0.42 0.53 0.48
## F.Undergrad -0.23 0.02 -0.08
## P.Undergrad -0.28 -0.08 -0.26
## Outstate 0.57 0.67 0.58
## Room.Board 0.27 0.50 0.42
## Books -0.04 0.11 0.00
## Personal -0.29 -0.10 -0.27
## PhD 0.25 0.44 0.32
## Terminal 0.27 0.44 0.30
## S.F.Ratio -0.40 -0.58 -0.31
## perc.alumni 1.00 0.42 0.49
## Expend 0.42 1.00 0.39
## Grad.Rate 0.49 0.39 1.00
Note: The variable Apps has the most correlation with the variables Accept, Enroll and F.Undergrad. And has the least correlation with Outstate and perc.alumni variables. And High correlations can be seen among the independent variables themselves.
par(mfrow = c(3, 3), mar = c(2, 2, 2, 2))
for (i in 3:11) {
plot(college_data[, i], college_data$Apps, xlab = "", main = paste("Apps vs", names(college_data)[i]))
}
for (i in 12:18) {
plot(college_data[, i], college_data$Apps, xlab = "", main = paste("Apps vs", names(college_data)[i]))
}
par(mfrow = c(1, 1))
Note: The Apps variable is linearly related to some variables and non-linearly related to some of them.
Most variables do not follow the normal function, which is not a concern because in regression the variables do not necessarily have to be normal.
Questions of Data understanding
Answer: The above questions are answered during section A. more details:
Variable A was a bit vague and I did not notice it accurately.
Other variables that could help solve the problem:
These two steps overlap, so let’s check together.
Questions of Data preparation
Answer: A number of very outdated data were deleted, the data was converted and reported.
sum(is.na(college_data))
## [1] 0
Note: Fortunately, our dataset does not contain missing values.
We divide the data set into two parts, train and test. And we build our model on train and check it on the test.
#Divide Dataset into Train and Test
set.seed(12)
train_cases <- sample(1:nrow(college_data), nrow(college_data) * 0.7)
train <- college_data[train_cases, ]
test <- college_data[-train_cases, ]
dim(train)
## [1] 541 18
dim(test)
## [1] 233 18
summary(train)
## Private Apps Accept Enroll Top10perc
## No :148 Min. : 100 Min. : 90 Min. : 35 Min. : 1.00
## Yes:393 1st Qu.: 776 1st Qu.: 595 1st Qu.: 242 1st Qu.:15.00
## Median : 1616 Median : 1159 Median : 438 Median :24.00
## Mean : 3099 Mean : 2075 Mean : 810 Mean :28.25
## 3rd Qu.: 3877 3rd Qu.: 2603 3rd Qu.: 944 3rd Qu.:36.00
## Max. :20192 Max. :15096 Max. :6392 Max. :95.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.00 Min. : 282 Min. : 1.0 Min. : 2580
## 1st Qu.: 40.00 1st Qu.: 1022 1st Qu.: 95.0 1st Qu.: 7230
## Median : 54.00 Median : 1757 Median : 353.0 Median : 9950
## Mean : 56.21 Mean : 3830 Mean : 850.2 Mean :10423
## 3rd Qu.: 70.00 3rd Qu.: 4623 3rd Qu.: 982.0 3rd Qu.:12950
## Max. :100.00 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1880 Min. : 110.0 Min. : 300 Min. : 8.00
## 1st Qu.:3579 1st Qu.: 468.0 1st Qu.: 900 1st Qu.: 62.00
## Median :4180 Median : 500.0 Median :1200 Median : 75.00
## Mean :4359 Mean : 553.2 Mean :1351 Mean : 72.35
## 3rd Qu.:5062 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :4913 Max. :100.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.00 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 70.00 1st Qu.:11.50 1st Qu.:14.00 1st Qu.: 6786
## Median : 81.00 Median :13.70 Median :21.00 Median : 8328
## Mean : 79.42 Mean :14.15 Mean :22.97 Mean : 9617
## 3rd Qu.: 92.00 3rd Qu.:16.60 3rd Qu.:31.00 3rd Qu.:10864
## Max. :100.00 Max. :39.80 Max. :64.00 Max. :45702
## Grad.Rate
## Min. : 10.0
## 1st Qu.: 54.0
## Median : 65.0
## Mean : 65.9
## 3rd Qu.: 78.0
## Max. :100.0
summary(test)
## Private Apps Accept Enroll Top10perc
## No : 62 Min. : 81 Min. : 72 Min. : 46.0 Min. : 1.00
## Yes:171 1st Qu.: 785 1st Qu.: 632 1st Qu.: 242.0 1st Qu.:15.00
## Median : 1444 Median : 1086 Median : 417.0 Median :22.00
## Mean : 2588 Mean : 1785 Mean : 697.5 Mean :26.02
## 3rd Qu.: 3073 3rd Qu.: 2042 3rd Qu.: 776.0 3rd Qu.:31.00
## Max. :21804 Max. :18744 Max. :5874.0 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 13.00 Min. : 139 Min. : 3.0 Min. : 2340
## 1st Qu.: 43.00 1st Qu.: 943 1st Qu.: 95.0 1st Qu.: 7560
## Median : 54.00 Median : 1656 Median : 355.0 Median :10217
## Mean : 54.87 Mean : 3345 Mean : 861.5 Mean :10523
## 3rd Qu.: 66.00 3rd Qu.: 3499 3rd Qu.: 871.0 3rd Qu.:12850
## Max. :100.00 Max. :30017 Max. :10962.0 Max. :20100
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 10.0
## 1st Qu.:3620 1st Qu.: 470.0 1st Qu.: 800 1st Qu.: 62.0
## Median :4270 Median : 500.0 Median :1200 Median : 76.0
## Mean :4357 Mean : 539.5 Mean :1321 Mean : 73.4
## 3rd Qu.:5045 3rd Qu.: 600.0 3rd Qu.:1675 3rd Qu.: 87.0
## Max. :6950 Max. :1400.0 Max. :6800 Max. :100.0
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 33.00 Min. : 3.30 Min. : 0.00 Min. : 3480
## 1st Qu.: 72.00 1st Qu.:11.40 1st Qu.:11.00 1st Qu.: 6744
## Median : 85.00 Median :13.50 Median :20.00 Median : 8539
## Mean : 80.39 Mean :13.92 Mean :22.27 Mean : 9778
## 3rd Qu.: 91.00 3rd Qu.:15.90 3rd Qu.:31.00 3rd Qu.:10779
## Max. :100.00 Max. :27.80 Max. :60.00 Max. :56233
## Grad.Rate
## Min. : 15.00
## 1st Qu.: 52.00
## Median : 65.00
## Mean : 64.26
## 3rd Qu.: 77.00
## Max. :100.00
Note: train and test data distribution is close together which is good.
m1 <- lm(Apps ~ . , data = train)
summary(m1)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3854.9 -437.0 -83.7 287.1 6268.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -681.06298 494.40443 -1.378 0.168934
## PrivateYes -523.96480 162.09226 -3.233 0.001304 **
## Accept 1.29079 0.05975 21.603 < 2e-16 ***
## Enroll -0.22694 0.20714 -1.096 0.273744
## Top10perc 46.83186 6.74940 6.939 1.17e-11 ***
## Top25perc -15.04953 5.32878 -2.824 0.004921 **
## F.Undergrad 0.05390 0.03601 1.497 0.135030
## P.Undergrad 0.05653 0.03577 1.580 0.114609
## Outstate -0.06529 0.02302 -2.836 0.004742 **
## Room.Board 0.13355 0.05485 2.435 0.015237 *
## Books -0.15121 0.25937 -0.583 0.560143
## Personal -0.01000 0.07752 -0.129 0.897354
## PhD -10.65127 5.46434 -1.949 0.051802 .
## Terminal -1.93004 5.86554 -0.329 0.742252
## S.F.Ratio 17.91148 15.11256 1.185 0.236474
## perc.alumni -4.07861 4.77966 -0.853 0.393869
## Expend 0.10921 0.01680 6.500 1.88e-10 ***
## Grad.Rate 12.81339 3.62074 3.539 0.000438 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 999 on 523 degrees of freedom
## Multiple R-squared: 0.9263, Adjusted R-squared: 0.9239
## F-statistic: 386.5 on 17 and 523 DF, p-value: < 2.2e-16
Note: Model summary results
Conclusion: Variables that do not perform well in the statistical test T are removed from the model.
m1_1 <- lm(Apps ~ Private + Accept + Top10perc + Outstate + Room.Board + Expend + Grad.Rate , data = train)
summary(m1_1)
##
## Call:
## lm(formula = Apps ~ Private + Accept + Top10perc + Outstate +
## Room.Board + Expend + Grad.Rate, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3929.4 -472.8 -48.6 304.7 6415.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.283e+03 2.185e+02 -5.874 7.50e-09 ***
## PrivateYes -4.630e+02 1.478e+02 -3.132 0.00183 **
## Accept 1.303e+00 2.374e-02 54.878 < 2e-16 ***
## Top10perc 2.705e+01 3.700e+00 7.311 9.75e-13 ***
## Outstate -9.101e-02 2.159e-02 -4.216 2.92e-05 ***
## Room.Board 1.325e-01 5.290e-02 2.504 0.01256 *
## Expend 1.072e-01 1.493e-02 7.176 2.42e-12 ***
## Grad.Rate 8.989e+00 3.417e+00 2.631 0.00876 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1016 on 533 degrees of freedom
## Multiple R-squared: 0.9223, Adjusted R-squared: 0.9213
## F-statistic: 903.9 on 7 and 533 DF, p-value: < 2.2e-16
Note: The condition of the model is good based on the previous statements, but the assumptions of the regression model should also be considered.
We can generalize the above results to the community when the following hypotheses are met.
Regression has two main assumptions:
#Check Assumption of Regression
##Normality of residuals
###Histogram
hist(m1_1$residuals, probability = T)
lines(density(m1$residuals), col = "red")
###QQplot
qqnorm(m1_1$residuals , main = "QQ_plot of residuals", pch = 20)
qqline(m1_1$residuals, col = "red")
###Test Skewness and Kurtosis
moments :: jarque.test(m1_1$residuals)
##
## Jarque-Bera Normality Test
##
## data: m1_1$residuals
## JB = 3442.5, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(m1_1$residuals)
##
## Anscombe-Glynn kurtosis test
##
## data: m1_1$residuals
## kurt = 14.538, z = 10.741, p-value < 2.2e-16
## alternative hypothesis: kurtosis is not equal to 3
Conclusion: Residuals are not normally distributed. So the first assumption was rejected.
#Diagnostic plots
plot(m1_1)
Note: Diagnostic plot show:
conclusion: There is a problem of heteroscedasticity.
#Check Multicollinearity
car :: vif(m1)
## Private Accept Enroll Top10perc Top25perc F.Undergrad
## 2.830423 10.642178 21.103948 7.983988 6.282830 16.880319
## P.Undergrad Outstate Room.Board Books Personal PhD
## 1.658309 4.671284 2.082366 1.098323 1.336526 4.344837
## Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 4.207430 2.005022 1.868395 3.728812 2.003254
car :: vif(m1_1)
## Private Accept Top10perc Outstate Room.Board Expend Grad.Rate
## 2.276805 1.624296 2.320346 3.972856 1.873046 2.849304 1.725085
Note: If the numbers are less than 10, there is no problem of multicolinearity.
conclusion: The model does not have the problem of multicolinearity and the results of the statistical test T can be trusted.
This model has violated the assumptions of the regression model, so it seems to be a bad model.
#prediction
pred_m1_1 <- predict(m1_1, test)
#Absolute error
abs_err_m1_1 <- abs(pred_m1_1 - test$Apps)
summary(abs_err_m1_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.099 199.481 390.507 596.973 709.045 6972.844
sd(abs_err_m1_1)
## [1] 773.7654
#Histogram and Boxplot
hist(abs_err_m1_1)
boxplot(abs_err_m1_1)
df_err <- data.frame("m1_1" = abs_err_m1_1 )
models_comp <- data.frame("mean of Abs Error" = apply(df_err, 2, mean ),
"median of Abs Error" = apply(df_err, 2, median),
"SD of Abs Error" = apply(df_err, 2, sd ),
"min of Abs Error" = apply(df_err, 2, min ),
"max of Abs Error" = apply(df_err, 2, max) )
models_comp
## mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error min.of.Abs.Error
## m1_1 596.9731 390.5073 773.7654 2.098694
## max.of.Abs.Error
## m1_1 6972.844
Note: We create a data frame to compare models.
When the data is skewed, we transformation the variable so that the data distribution follows the normal.
#find best lambda for boxcox transformation
library("MASS")
box_results <- boxcox(Apps ~ ., data = train, lambda = seq(-5, 5, 0.1))
box_results <- data.frame(box_results$x, box_results$y)
lambda <- box_results[which(box_results$box_results.y == max(box_results$box_results.y)), 1]
lambda
## [1] 0.5
#create boxcox_Apps variable
train$boxcox_Apps <- (train$Apps ^ 0.5 - 1) / 0.5
#check normality
##Histogram
hist(train$boxcox_Apps, probability = TRUE, breaks = 15)
lines(density(train$boxcox_Apps), col = "red")
##QQplot
qqnorm(train$boxcox_Apps, main = "QQ_plot", pch = 20)
qqline(train$boxcox_Apps, col = "red")
##Test for skewness and kurtosis
jarque.test(train$boxcox_Apps)
##
## Jarque-Bera Normality Test
##
## data: train$boxcox_Apps
## JB = 125.36, p-value < 2.2e-16
## alternative hypothesis: greater
anscombe.test(train$boxcox_Apps)
##
## Anscombe-Glynn kurtosis test
##
## data: train$boxcox_Apps
## kurt = 3.6693, z = 2.6295, p-value = 0.008551
## alternative hypothesis: kurtosis is not equal to 3
Conclusion: There was a lot of skewness and we could not convert it to Normal Distribution by Boxcox Transformation.
m2 <- lm(boxcox_Apps ~ . - Apps, data = train)
summary(m2)
##
## Call:
## lm(formula = boxcox_Apps ~ . - Apps, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.741 -9.757 -0.242 8.710 64.487
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.313e+01 8.324e+00 -1.577 0.11536
## PrivateYes -1.668e+01 2.729e+00 -6.111 1.93e-09 ***
## Accept 1.779e-02 1.006e-03 17.679 < 2e-16 ***
## Enroll 7.607e-04 3.488e-03 0.218 0.82741
## Top10perc 4.770e-01 1.136e-01 4.198 3.17e-05 ***
## Top25perc -8.633e-02 8.972e-02 -0.962 0.33640
## F.Undergrad -6.522e-04 6.063e-04 -1.076 0.28252
## P.Undergrad 1.728e-03 6.022e-04 2.869 0.00428 **
## Outstate -2.256e-04 3.876e-04 -0.582 0.56078
## Room.Board 2.515e-03 9.236e-04 2.723 0.00669 **
## Books 6.401e-03 4.367e-03 1.466 0.14331
## Personal 1.290e-03 1.305e-03 0.988 0.32357
## PhD 4.105e-02 9.200e-02 0.446 0.65567
## Terminal 1.073e-01 9.876e-02 1.087 0.27763
## S.F.Ratio 1.218e+00 2.545e-01 4.787 2.21e-06 ***
## perc.alumni -1.147e-01 8.048e-02 -1.426 0.15456
## Expend 1.760e-03 2.829e-04 6.222 1.01e-09 ***
## Grad.Rate 2.750e-01 6.096e-02 4.511 7.96e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.82 on 523 degrees of freedom
## Multiple R-squared: 0.9091, Adjusted R-squared: 0.9061
## F-statistic: 307.5 on 17 and 523 DF, p-value: < 2.2e-16
Conclusion: Variables that do not perform well in the statistical test T are removed from the model.
m2_1 <- lm(boxcox_Apps ~ Private + Accept + Top10perc + P.Undergrad + Room.Board + S.F.Ratio + Expend + Grad.Rate , data = train)
summary(m2_1)
##
## Call:
## lm(formula = boxcox_Apps ~ Private + Accept + Top10perc + P.Undergrad +
## Room.Board + S.F.Ratio + Expend + Grad.Rate, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.146 -9.852 -0.214 8.722 66.601
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.012e+00 6.259e+00 -0.481 0.630533
## PrivateYes -1.847e+01 2.336e+00 -7.908 1.52e-14 ***
## Accept 1.719e-02 4.135e-04 41.576 < 2e-16 ***
## Top10perc 3.956e-01 6.113e-02 6.473 2.19e-10 ***
## P.Undergrad 1.658e-03 5.612e-04 2.955 0.003262 **
## Room.Board 2.924e-03 8.247e-04 3.545 0.000426 ***
## S.F.Ratio 1.240e+00 2.518e-01 4.924 1.14e-06 ***
## Expend 1.840e-03 2.564e-04 7.176 2.42e-12 ***
## Grad.Rate 2.291e-01 5.570e-02 4.113 4.53e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.87 on 532 degrees of freedom
## Multiple R-squared: 0.907, Adjusted R-squared: 0.9056
## F-statistic: 648.3 on 8 and 532 DF, p-value: < 2.2e-16
#Check Assumptions of Regression
plot(m2_1)
Conclusion: Residuals are not normally distributed. So the first assumption was rejected. and There is a problem of heteroscedasticity.
#Check Multicollinearity
car :: vif(m2_1)
## Private Accept Top10perc P.Undergrad Room.Board S.F.Ratio
## 2.061826 1.787393 2.296774 1.431806 1.650965 1.952909
## Expend Grad.Rate
## 3.045916 1.662579
conclusion: The model does not have the problem of multicolinearity and the results of the statistical test can be trusted.
#prediction
pred_m2_1 <- predict(m2_1, test)
pred_m2_1 <- (pred_m2_1 + 2 / 2) ^ 2
#Absolute error
abs_err_m2_1 <- abs(pred_m2_1 - test$Apps)
summary(abs_err_m2_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 44.96 2635.42 4533.46 7667.82 8234.31 141274.20
sd(abs_err_m2_1)
## [1] 12179.97
#Histogram and Boxplot
hist(abs_err_m2_1)
boxplot(abs_err_m2_1)
Conclusion:Box_Cox Transformation makes the Residuals situation worse. And this model is bad.
df_err <- data.frame("m1_1" = abs_err_m1_1,
"m2_1" = abs_err_m2_1)
models_comp <- data.frame("mean of Abs Error" = apply(df_err, 2, mean ),
"median of Abs Error" = apply(df_err, 2, median),
"SD of Abs Error" = apply(df_err, 2, sd ),
"min of Abs Error" = apply(df_err, 2, min ),
"max of Abs Error" = apply(df_err, 2, max) )
models_comp
## mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error min.of.Abs.Error
## m1_1 596.9731 390.5073 773.7654 2.098694
## m2_1 7667.8177 4533.4558 12179.9685 44.958069
## max.of.Abs.Error
## m1_1 6972.844
## m2_1 141274.196
conclusion: By far, Model 1 is better than Model 2.
head(train)
## Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 451 Yes 692 576 174 19 50 597 83
## 347 Yes 499 441 199 26 52 846 377
## 337 Yes 874 758 428 21 46 1605 246
## 248 Yes 467 424 350 16 40 1365 334
## 175 Yes 13789 3893 1583 90 98 6188 53
## 454 Yes 1133 630 220 37 73 750 30
## Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni
## 451 10500 3860 600 940 58 64 11.6 19
## 347 11200 7400 600 1300 66 79 6.8 50
## 337 9858 3700 450 1200 42 45 17.6 16
## 248 6300 2980 700 2140 75 79 13.7 10
## 175 18590 5950 625 1162 95 96 5.0 44
## 454 17688 5900 650 850 100 100 10.4 11
## Expend Grad.Rate boxcox_Apps
## 451 8990 39 50.61179
## 347 10819 90 42.67662
## 337 4796 55 57.12698
## 248 7054 38 41.22037
## 175 27206 97 232.85315
## 454 14820 73 65.32013
train <- train[,-19]
Display train data evenly in k folders. and We have 17 models and we create a matrix that gets the model error based on each folder.
library("leaps")
k <- 10
set.seed(123)
folds <- sample(1:k, nrow(train), rep = TRUE)
cv_errors <- matrix(NA, k, 17, dimnames = list(NULL, paste(1:17)))
cv_errors
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [1,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [2,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [3,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [4,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [5,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [6,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [7,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [8,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [9,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [10,] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
predict_regsubsets <- function(object, newdata, id){
reg_formula <- as.formula(object$call[[2]])
mat <- model.matrix(reg_formula, newdata)
coef_i <- coef(object, id = id)
mat[, names(coef_i)] %*% coef_i
}
for (i in 1:k) {
best_fit <- regsubsets(Apps ~ ., data = train[folds != i, ], nvmax = 17)
for ( j in 1:17) {
pred <- predict_regsubsets(best_fit, newdata = train[folds == i,], id = j)
cv_errors[i, j] <- mean((train$Apps[folds == i] - pred) ^ 2)
}
}
cv_errors
## 1 2 3 4 5 6 7
## [1,] 1147040.2 734013.3 622946.9 664345.8 682101.5 656489.3 723483.1
## [2,] 711738.7 729031.0 668876.8 645100.8 583021.2 570368.5 611519.2
## [3,] 1726947.5 1189231.2 1265115.0 1384921.8 1353493.5 1410367.1 1405397.3
## [4,] 976642.2 944835.8 929006.1 890367.3 935175.4 992716.3 945305.6
## [5,] 491653.6 605304.1 592482.7 556010.3 560040.3 547570.7 539480.4
## [6,] 3078756.3 1821222.4 1701720.1 1617094.8 1535879.3 1544437.9 1485121.0
## [7,] 1491829.2 941344.3 944381.4 663933.4 702487.8 679105.3 663658.7
## [8,] 2375703.4 1653448.9 1700841.8 1646361.3 1682634.6 1682288.6 1625430.3
## [9,] 2822617.7 1834083.7 1656356.6 1630456.6 1640019.6 1632766.4 1620522.9
## [10,] 1484201.6 1256031.3 1282608.3 1218994.2 1225445.2 1204905.1 1189939.5
## 8 9 10 11 12 13 14
## [1,] 681032.1 673576.3 645447.8 657104.1 668861.8 668284.0 661776.0
## [2,] 578855.8 562382.3 553115.2 548426.2 557424.8 552990.5 561225.4
## [3,] 1390953.1 1335829.7 1313677.2 1341407.8 1322889.0 1309580.7 1312246.5
## [4,] 983446.9 919248.6 961848.3 964068.8 946920.9 942806.7 1037027.8
## [5,] 514616.3 469771.1 498348.6 509011.6 524459.0 511355.5 518352.7
## [6,] 1445470.8 1398775.5 1447689.9 1502249.5 1487995.1 1493032.8 1485943.9
## [7,] 659452.2 586361.8 587831.8 588993.1 586853.2 575549.6 573131.6
## [8,] 1597411.4 1513743.1 1543913.4 1592400.4 1596843.6 1635225.7 1630923.6
## [9,] 1613383.9 1552481.4 1523411.7 1558764.7 1542918.2 1530932.2 1540417.4
## [10,] 1218153.7 1194336.2 1143673.1 1157847.6 1158205.7 1155268.8 1163186.7
## 15 16 17
## [1,] 650638.6 655356.8 655963.9
## [2,] 556175.6 557230.3 556015.5
## [3,] 1321141.9 1327150.8 1327556.4
## [4,] 1032288.5 1036177.6 1035327.7
## [5,] 492829.8 491480.0 491881.2
## [6,] 1482149.3 1484401.1 1486444.4
## [7,] 578544.4 578259.8 575516.3
## [8,] 1627024.7 1632870.7 1631921.2
## [9,] 1535589.5 1534823.3 1539092.6
## [10,] 1160835.7 1164056.5 1167395.5
Note:It was modeled 10 times for each and its errors were reported.
mean_cv_errors <- apply(cv_errors, 2, mean)
mean_cv_errors
## 1 2 3 4 5 6 7 8 9 10
## 1630713 1170855 1136434 1091759 1090030 1092102 1080986 1068278 1020651 1021896
## 11 12 13 14 15 16 17
## 1042027 1039337 1037503 1048423 1043722 1046181 1046711
plot(mean_cv_errors, type = "b")
which.min(mean_cv_errors)
## 9
## 9
Note: We get the average of each column to find out which model is the best. And A model with 9 variables is the best.
m3 <- regsubsets(Apps ~ ., data = train, nvmax = 17)
coef(m3,9)
## (Intercept) PrivateYes Accept Top10perc Top25perc
## -386.77781410 -630.44356361 1.31896484 45.36497501 -14.46736061
## Outstate Room.Board PhD Expend Grad.Rate
## -0.07683097 0.15147886 -10.84808271 0.10042862 10.59897483
conclusion: We build a new model with these variables
#Model making
m3_1 <- lm(Apps ~ Private + Accept + Top10perc + Top25perc + Outstate + Room.Board + PhD + Expend +Grad.Rate , data = train)
summary(m3_1)
##
## Call:
## lm(formula = Apps ~ Private + Accept + Top10perc + Top25perc +
## Outstate + Room.Board + PhD + Expend + Grad.Rate, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3753.8 -436.5 -58.5 293.0 6119.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -386.77781 298.75728 -1.295 0.196013
## PrivateYes -630.44356 153.44125 -4.109 4.61e-05 ***
## Accept 1.31896 0.02367 55.733 < 2e-16 ***
## Top10perc 45.36498 6.60507 6.868 1.82e-11 ***
## Top25perc -14.46736 5.21210 -2.776 0.005702 **
## Outstate -0.07683 0.02151 -3.572 0.000386 ***
## Room.Board 0.15148 0.05235 2.894 0.003963 **
## PhD -10.84808 3.58658 -3.025 0.002610 **
## Expend 0.10043 0.01523 6.595 1.03e-10 ***
## Grad.Rate 10.59897 3.38373 3.132 0.001830 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1000 on 531 degrees of freedom
## Multiple R-squared: 0.925, Adjusted R-squared: 0.9237
## F-statistic: 727.6 on 9 and 531 DF, p-value: < 2.2e-16
#Check Assumptions of Regression
plot(m3_1)
Note: There is a problem of heteroscedasticity.
#Prediction
pred_m3_1 <- predict(m3_1, test)
#Absolute error
abs_err_m3_1 <- abs(pred_m3_1 - test$Apps)
summary(abs_err_m3_1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.237 171.543 398.241 596.230 692.421 6604.283
sd(abs_err_m3_1)
## [1] 755.0726
#Histogram and Boxplot
hist(abs_err_m3_1)
boxplot(abs_err_m3_1)
df_err <- data.frame("m1_1" = abs_err_m1_1,
"m2_1" = abs_err_m2_1,
"m3_1" = abs_err_m3_1)
models_comp <- data.frame("mean of Abs Error" = apply(df_err, 2, mean ),
"median of Abs Error" = apply(df_err, 2, median),
"SD of Abs Error" = apply(df_err, 2, sd ),
"min of Abs Error" = apply(df_err, 2, min ),
"max of Abs Error" = apply(df_err, 2, max) )
models_comp
## mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error min.of.Abs.Error
## m1_1 596.9731 390.5073 773.7654 2.0986942
## m2_1 7667.8177 4533.4558 12179.9685 44.9580688
## m3_1 596.2298 398.2408 755.0726 0.2370317
## max.of.Abs.Error
## m1_1 6972.844
## m2_1 141274.196
## m3_1 6604.283
#boxplot of absolute errors
boxplot(df_err, main = "Abs.Error Dist. of models")
conclusion: Model 1 and Model 3 are similar. And Model 2 is inefficient.
Regularization
Ridge Regression:
The goal is the optimize:
RSS + lambda * sum(beta_i ^ 2)
lambda >= 0, a tuning parameter
Note: for Ridge Regression must making model.matrix()
x <- model.matrix(Apps ~ . , data = train)[, -1] #Remove intercepts
y <- train$Apps
#Apply ridge regression
library("glmnet")
## Loading required package: Matrix
## Loaded glmnet 4.0-2
m4_ridge <- glmnet(x, y, alpha = 0)
dim(coef(m4_ridge))
## [1] 18 100
Note: R creates lambda. We have 100 lambda
If alpha = 0 then create Ridge Regression model.
If alpha = 1 then create Lasso model.
plot(m4_ridge, xvar = "lambda")
Note:The larger the lambda, the closer the regression coefficients are to 0 but never to 0.
ridge_cv <- cv.glmnet(x, y, alpha = 0)
best_lambda <- ridge_cv$lambda.min
best_lambda
## [1] 338.1328
Conclusion: the best lambda is 338.13 .
ridge_model <- glmnet(x, y, alpha = 0, lambda = best_lambda)
coef(ridge_model)
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -1.616039e+03
## PrivateYes -5.476519e+02
## Accept 7.684774e-01
## Enroll 6.658907e-01
## Top10perc 2.558789e+01
## Top25perc -1.338570e+00
## F.Undergrad 1.075362e-01
## P.Undergrad 2.563345e-02
## Outstate -1.213134e-03
## Room.Board 1.848040e-01
## Books -1.586445e-02
## Personal -2.188587e-02
## PhD -3.418311e+00
## Terminal -3.757458e+00
## S.F.Ratio 1.245465e+01
## perc.alumni -1.238471e+01
## Expend 9.836593e-02
## Grad.Rate 1.292104e+01
Note: Regression coefficients were obtained in proportion to the best lambda.
x_test <- model.matrix(Apps ~ . , data = test)[, -1]
pred_m4_ridge <- predict(m4_ridge, s = best_lambda, newx = x_test)
pred_m4_ridge <- c(pred_m4_ridge)
abs_err_m4_ridge <- abs(pred_m4_ridge - test$Apps)
summary(abs_err_m4_ridge)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.108 140.511 405.389 637.808 822.121 6530.076
sd(abs_err_m4_ridge)
## [1] 814.0771
#Histogram and Boxplot
hist(abs_err_m4_ridge)
boxplot(abs_err_m4_ridge)
df_err <- data.frame("m1_1" = abs_err_m1_1,
"m3_1" = abs_err_m3_1,
"m4_ridge" = abs_err_m4_ridge)
models_comp <- data.frame("mean of Abs Error" = apply(df_err, 2, mean ),
"median of Abs Error" = apply(df_err, 2, median),
"SD of Abs Error" = apply(df_err, 2, sd ),
"min of Abs Error" = apply(df_err, 2, min ),
"max of Abs Error" = apply(df_err, 2, max) )
models_comp
## mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error min.of.Abs.Error
## m1_1 596.9731 390.5073 773.7654 2.0986942
## m3_1 596.2298 398.2408 755.0726 0.2370317
## m4_ridge 637.8083 405.3886 814.0771 5.1078962
## max.of.Abs.Error
## m1_1 6972.844
## m3_1 6604.283
## m4_ridge 6530.076
Conclusion: The new model is not better than the previous two models.
#boxplot of absolute errors
boxplot(df_err, main = "Abs.Error Dist. of models")
conclusion: m1_1 and m3_1 models are the best.
Regularization
Lasso Regression:
The goal is the optimize:
RSS + lambda * sum(abs(beta_i))
lambda >= 0, a tuning parameter
Note: Lasso Regression is used when not all variables are good and we want to select the best variables to build the model.
#Apply lasso regression
m5_lasso <- glmnet(x, y, alpha = 1)
dim(coef(m5_lasso))
## [1] 18 81
Note: R creates lambda. We have 81 lambda.
plot(m5_lasso, xvar = "lambda")
lasso_cv <- cv.glmnet(x, y, alpha = 1)
best_lambda <- lasso_cv$lambda.min
best_lambda
## [1] 8.774628
Conclusion: the best lambda is 8.77 .
lasso_model <- glmnet(x, y, alpha = 1, lambda = best_lambda)
coef(lasso_model)
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -809.48674882
## PrivateYes -515.76727715
## Accept 1.25061635
## Enroll .
## Top10perc 39.83066075
## Top25perc -9.76835696
## F.Undergrad 0.02828286
## P.Undergrad 0.04658106
## Outstate -0.05191041
## Room.Board 0.11921218
## Books -0.06515242
## Personal .
## PhD -8.33462968
## Terminal -2.40392165
## S.F.Ratio 12.91780479
## perc.alumni -4.51089751
## Expend 0.10401740
## Grad.Rate 11.17444304
Note: Variables Enroll and Personal are not used in model construction.
pred_m5_lasso <- predict(m5_lasso, s = best_lambda, newx = x_test)
pred_m5_lasso <- c(pred_m5_lasso)
abs_err_m5_lasso <- abs(pred_m5_lasso - test$Apps)
summary(abs_err_m5_lasso)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.78 176.74 354.43 582.60 691.62 6669.74
sd(abs_err_m5_lasso)
## [1] 757.9307
#Histogram and Boxplot
hist(abs_err_m5_lasso)
boxplot(abs_err_m5_lasso)
df_err <- data.frame("m1_1" = abs_err_m1_1,
"m3_1" = abs_err_m3_1,
"m4_ridge" = abs_err_m4_ridge,
"m5_lasso" = abs_err_m5_lasso)
models_comp <- data.frame("mean of Abs Error" = apply(df_err, 2, mean ),
"median of Abs Error" = apply(df_err, 2, median),
"SD of Abs Error" = apply(df_err, 2, sd ),
"min of Abs Error" = apply(df_err, 2, min ),
"max of Abs Error" = apply(df_err, 2, max) )
models_comp
## mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error min.of.Abs.Error
## m1_1 596.9731 390.5073 773.7654 2.0986942
## m3_1 596.2298 398.2408 755.0726 0.2370317
## m4_ridge 637.8083 405.3886 814.0771 5.1078962
## m5_lasso 582.5965 354.4272 757.9307 2.7801727
## max.of.Abs.Error
## m1_1 6972.844
## m3_1 6604.283
## m4_ridge 6530.076
## m5_lasso 6669.743
Note: Lasso model works a little better.
#boxplot of absolute errors
boxplot(df_err, main = "Abs.Error Dist. of models")
library("randomForest")
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
m6_bagging <- randomForest(Apps ~ ., mtry = ncol(train) - 1, data = train)
pred_m6_bagging <- predict(m6_bagging,newdata = test)
abs_err_6_bagging <- abs(pred_m6_bagging - test$Apps)
summary(abs_err_6_bagging)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.419 56.930 168.245 446.009 546.522 6774.290
sd(abs_err_6_bagging)
## [1] 796.3811
df_err <- data.frame("m1_1" = abs_err_m1_1,
"m3_1" = abs_err_m3_1,
"m4_ridge" = abs_err_m4_ridge,
"m5_lasso" = abs_err_m5_lasso,
"m6_bagging" = abs_err_6_bagging)
models_comp <- data.frame("mean of Abs Error" = apply(df_err, 2, mean ),
"median of Abs Error" = apply(df_err, 2, median),
"SD of Abs Error" = apply(df_err, 2, sd ),
"min of Abs Error" = apply(df_err, 2, min ),
"max of Abs Error" = apply(df_err, 2, max) )
models_comp
## mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error
## m1_1 596.9731 390.5073 773.7654
## m3_1 596.2298 398.2408 755.0726
## m4_ridge 637.8083 405.3886 814.0771
## m5_lasso 582.5965 354.4272 757.9307
## m6_bagging 446.0091 168.2453 796.3811
## min.of.Abs.Error max.of.Abs.Error
## m1_1 2.0986942 6972.844
## m3_1 0.2370317 6604.283
## m4_ridge 5.1078962 6530.076
## m5_lasso 2.7801727 6669.743
## m6_bagging 0.4188667 6774.290
Note: Bagging works better than other models.
#boxplot of absolute errors
boxplot(df_err, main = "Abs.Error Dist. of models")
#create random forest model
m7_rf <- randomForest(Apps ~ ., data = train, importance = TRUE)
m7_rf
##
## Call:
## randomForest(formula = Apps ~ ., data = train, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 939736.9
## % Var explained: 92.82
Note: Our forest is made up of 500 trees, which explains 92% of the variance.
#model interpretation
importance(m7_rf)
## %IncMSE IncNodePurity
## Private 6.238110 156795273
## Accept 27.821061 2417952619
## Enroll 19.193297 1782445001
## Top10perc 9.676049 209983208
## Top25perc 12.249743 193720006
## F.Undergrad 15.647214 1196008946
## P.Undergrad 5.412150 147035186
## Outstate 13.462204 138037408
## Room.Board 11.312324 84674014
## Books 1.912405 28965936
## Personal 3.359643 40517105
## PhD 7.210371 177730891
## Terminal 7.041672 139776678
## S.F.Ratio 8.124410 52224570
## perc.alumni 2.176190 30647289
## Expend 9.664250 174237461
## Grad.Rate 9.677273 107670091
varImpPlot(m7_rf)
Note:model interpretation
By removing Except, we have the greatest reduction in model accuracy.
pred_m7_rf <- predict(m7_rf, newdata = test)
abs_err_m7_rf <- abs(pred_m7_rf - test$Apps)
summary(abs_err_m7_rf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.1 101.3 222.2 512.3 606.3 6733.1
sd(abs_err_m7_rf)
## [1] 854.9199
##### Comparisions of models
df_err <- data.frame("m1_1" = abs_err_m1_1,
"m3_1" = abs_err_m3_1,
"m4_ridge" = abs_err_m4_ridge,
"m5_lasso" = abs_err_m5_lasso,
"m6_bag" = abs_err_6_bagging,
"m7_rf" = abs_err_m7_rf)
models_comp <- data.frame("mean of Abs Error" = apply(df_err, 2, mean ),
"median of Abs Error" = apply(df_err, 2, median),
"SD of Abs Error" = apply(df_err, 2, sd ),
"min of Abs Error" = apply(df_err, 2, min ),
"max of Abs Error" = apply(df_err, 2, max) )
models_comp
## mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error min.of.Abs.Error
## m1_1 596.9731 390.5073 773.7654 2.0986942
## m3_1 596.2298 398.2408 755.0726 0.2370317
## m4_ridge 637.8083 405.3886 814.0771 5.1078962
## m5_lasso 582.5965 354.4272 757.9307 2.7801727
## m6_bag 446.0091 168.2453 796.3811 0.4188667
## m7_rf 512.2881 222.2484 854.9199 2.1003333
## max.of.Abs.Error
## m1_1 6972.844
## m3_1 6604.283
## m4_ridge 6530.076
## m5_lasso 6669.743
## m6_bag 6774.290
## m7_rf 6733.078
#boxplot of absolute errors
boxplot(df_err, main = "Abs.Error Dist. of models")
Conclusion: The bagging model is still the best model.
Tuning: We use Tuninig when we want to play a series of parameters to check different performance.
library("gbm")
## Loaded gbm 2.1.8
library("xgboost")
#Tuning
par_grid <- expand.grid(shrinkage = c(0.1 , 0.2, 0.3),
interaction_depth = c(2, 3, 4,5),
n_minobsinnode = c(9, 8, 10),
bag_fraction = c(0.9, 0.8, 1))
nrow(par_grid)
## [1] 108
We create all the modes and give them to the algorithm Gradient Boosting to check all the different modes to know which one works best. So we use Grid search.
#Grid search
for (i in 1:nrow(par_grid)) {
set.seed(1234)
# train model
gbm_tune <- gbm(formula = Apps ~ .,
distribution = "gaussian",
data = train,
n.trees = 500,
interaction.depth = par_grid$interaction_depth[i],
shrinkage = par_grid$shrinkage[i],
n.minobsinnode = par_grid$n_minobsinnode[i],
bag.fraction = par_grid$bag_fraction[i],
train.fraction = 0.7,
cv.folds = 0,
n.cores = NULL, #will use all cores by default
verbose = FALSE)
#add min training error and trees to grid
par_grid$optimal_trees[i] <- which.min(gbm_tune$valid.error)
par_grid$min_RMSE[i] <- sqrt(min(gbm_tune$valid.error))
}
head(par_grid)
## shrinkage interaction_depth n_minobsinnode bag_fraction optimal_trees
## 1 0.1 2 9 0.9 486
## 2 0.2 2 9 0.9 335
## 3 0.3 2 9 0.9 353
## 4 0.1 3 9 0.9 491
## 5 0.2 3 9 0.9 409
## 6 0.3 3 9 0.9 183
## min_RMSE
## 1 833.9233
## 2 850.1395
## 3 952.0971
## 4 834.3308
## 5 836.6362
## 6 887.2769
par_grid_s <- par_grid[order(par_grid$min_RMSE, decreasing = F),]
head(par_grid_s, 10)
## shrinkage interaction_depth n_minobsinnode bag_fraction optimal_trees
## 17 0.2 3 8 0.9 384
## 92 0.2 4 8 1.0 409
## 87 0.3 2 8 1.0 62
## 98 0.2 2 10 1.0 472
## 97 0.1 2 10 1.0 500
## 7 0.1 4 9 0.9 496
## 55 0.1 4 8 0.8 326
## 46 0.1 5 9 0.8 97
## 86 0.2 2 8 1.0 307
## 56 0.2 4 8 0.8 129
## min_RMSE
## 17 779.5410
## 92 784.7155
## 87 789.5404
## 98 793.9368
## 97 800.6561
## 7 801.3926
## 55 805.7055
## 46 806.1049
## 86 807.6660
## 56 808.1937
Conclusion: We select the best elements to build the prediction model in proportion to the least error.
This section tried several times to get the best elements.
#Final model
m8_gbm <- gbm(formula = Apps ~ .,
distribution = "gaussian",
data = train,
n.trees = 500,
interaction.depth = 8,
shrinkage = 0.1,
n.minobsinnode = 5,
bag.fraction = 0.9,
train.fraction = 0.7,
cv.folds = 0,
n.cores = NULL, #will use all cores by default
)
summary(m8_gbm)
## var rel.inf
## Accept Accept 76.47997902
## Enroll Enroll 8.83760384
## Top10perc Top10perc 2.93986520
## Top25perc Top25perc 2.78282312
## Grad.Rate Grad.Rate 2.48602268
## Outstate Outstate 1.61672541
## F.Undergrad F.Undergrad 1.40482916
## Books Books 0.95526587
## Expend Expend 0.82687152
## perc.alumni perc.alumni 0.46877540
## S.F.Ratio S.F.Ratio 0.40716333
## Room.Board Room.Board 0.28931028
## Personal Personal 0.15782873
## PhD PhD 0.13528664
## P.Undergrad P.Undergrad 0.10710970
## Terminal Terminal 0.07633852
## Private Private 0.02820160
Note:Shows that Accept and Enroll have the most influence.
pred_m8_gbm <- predict(m8_gbm, n.trees = 500, newdata = test)
abs_err_m8_gbm <- abs(pred_m8_gbm - test$Apps)
summary(abs_err_m8_gbm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.801 70.684 170.540 439.953 436.964 7315.743
sd(abs_err_m8_gbm)
## [1] 808.7992
df_err <- data.frame("m1_1" = abs_err_m1_1,
"m3_1" = abs_err_m3_1,
"m4_ridg" = abs_err_m4_ridge,
"m5_las" = abs_err_m5_lasso,
"m6_bag" = abs_err_6_bagging,
"m7_rf" = abs_err_m7_rf,
"m8_gbm" = abs_err_m8_gbm)
models_comp <- data.frame("mean of Abs Error" = apply(df_err, 2, mean ),
"median of Abs Error" = apply(df_err, 2, median),
"SD of Abs Error" = apply(df_err, 2, sd ),
"min of Abs Error" = apply(df_err, 2, min ),
"max of Abs Error" = apply(df_err, 2, max) )
models_comp
## mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error min.of.Abs.Error
## m1_1 596.9731 390.5073 773.7654 2.0986942
## m3_1 596.2298 398.2408 755.0726 0.2370317
## m4_ridg 637.8083 405.3886 814.0771 5.1078962
## m5_las 582.5965 354.4272 757.9307 2.7801727
## m6_bag 446.0091 168.2453 796.3811 0.4188667
## m7_rf 512.2881 222.2484 854.9199 2.1003333
## m8_gbm 439.9528 170.5401 808.7992 0.8006347
## max.of.Abs.Error
## m1_1 6972.844
## m3_1 6604.283
## m4_ridg 6530.076
## m5_las 6669.743
## m6_bag 6774.290
## m7_rf 6733.078
## m8_gbm 7315.743
Conclusion: Model Gradient Boost offers the best performance among forecasting models.
#boxplot of absolute errors
boxplot(df_err, main = "Abs.Error Dist. of models")
Use Tuning to examine different modes.
x <- model.matrix(Apps ~ ., data = train)[, -1]
y <- train$Apps
#Tuning
par_grid <- expand.grid(eta = c(0.01, 0.1, 0.3),
lambda = c(2, 3, 4),
max_depth = c(3, 5, 7),
subsample = c(0.8, 0.9, 1),
colsample_bytree = c(0.8, 0.9, 1))
nrow(par_grid)
## [1] 243
We create all the modes and give them to the algorithm Gradient Boosting to check all the different modes to know which one works best. So we use Grid search.
#Grid search
for(i in 1:nrow(par_grid )) {
set.seed(123)
#train model
xgb_tune <- xgboost(
data = x,
label = y,
eta = par_grid$eta[i],
max_depth = par_grid$max_depth[i],
subsample = par_grid$subsample[i],
colsample_bytree = par_grid$colsample_bytree[i],
nrounds = 1000,
objective = "reg:squarederror", # for regression models
verbose = 0, # silent,
early_stopping_rounds = 10 # stop if no improvement for 10 consecutive trees
)
# add min training error and trees to grid
par_grid$optimal_trees[i] <- which.min(xgb_tune$evaluation_log$train_rmse)
par_grid$min_RMSE[i] <- min(xgb_tune$evaluation_log$train_rmse)
}
head(par_grid)
## eta lambda max_depth subsample colsample_bytree optimal_trees min_RMSE
## 1 0.01 2 3 0.8 0.8 1000 294.815887
## 2 0.10 2 3 0.8 0.8 1000 12.256836
## 3 0.30 2 3 0.8 0.8 1000 0.051607
## 4 0.01 3 3 0.8 0.8 1000 294.815887
## 5 0.10 3 3 0.8 0.8 1000 12.256835
## 6 0.30 3 3 0.8 0.8 1000 0.051607
par_grid_s <- par_grid[order(par_grid$min_RMSE, decreasing = F),]
head(par_grid_s, 10)
## eta lambda max_depth subsample colsample_bytree optimal_trees min_RMSE
## 21 0.3 2 7 0.8 0.8 297 0.000613
## 24 0.3 3 7 0.8 0.8 297 0.000613
## 27 0.3 4 7 0.8 0.8 297 0.000613
## 174 0.3 2 5 0.8 1.0 473 0.000641
## 177 0.3 3 5 0.8 1.0 473 0.000641
## 180 0.3 4 5 0.8 1.0 473 0.000641
## 210 0.3 2 7 0.9 1.0 261 0.000653
## 213 0.3 3 7 0.9 1.0 261 0.000653
## 216 0.3 4 7 0.9 1.0 261 0.000653
## 48 0.3 2 7 0.9 0.8 247 0.000662
m9_xgb <- xgboost(data = x,
label = y,
eta = 0.3,
max_depth = 7,
lambda = 2,
nrounds = 1000,
colsample_bytree = 1,
subsample = 0.7,
objective = "reg:squarederror",
verbose = 0)
x_test <- model.matrix(Apps ~ ., data = test)[, -1]
pred_m9_xgb <- predict(m9_xgb, x_test)
abs_err_m9_xgb <- abs(pred_m9_xgb - test$Apps)
summary(abs_err_m9_xgb)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.914 55.511 164.383 459.185 454.833 5525.344
sd(abs_err_m9_xgb)
## [1] 801.3323
df_err <- data.frame("m1_1" = abs_err_m1_1,
"m3_1" = abs_err_m3_1,
"m4_ridge" = abs_err_m4_ridge,
"m5_lasso" = abs_err_m5_lasso,
"m6_bagging" = abs_err_6_bagging,
"m7_rf" = abs_err_m7_rf,
"m8_gbm" = abs_err_m8_gbm,
"m9_xgb" = abs_err_m9_xgb)
models_comp <- data.frame("mean of Abs Error" = apply(df_err, 2, mean ),
"median of Abs Error" = apply(df_err, 2, median),
"SD of Abs Error" = apply(df_err, 2, sd ),
"min of Abs Error" = apply(df_err, 2, min ),
"max of Abs Error" = apply(df_err, 2, max) )
models_comp
## mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error
## m1_1 596.9731 390.5073 773.7654
## m3_1 596.2298 398.2408 755.0726
## m4_ridge 637.8083 405.3886 814.0771
## m5_lasso 582.5965 354.4272 757.9307
## m6_bagging 446.0091 168.2453 796.3811
## m7_rf 512.2881 222.2484 854.9199
## m8_gbm 439.9528 170.5401 808.7992
## m9_xgb 459.1854 164.3834 801.3323
## min.of.Abs.Error max.of.Abs.Error
## m1_1 2.0986942 6972.844
## m3_1 0.2370317 6604.283
## m4_ridge 5.1078962 6530.076
## m5_lasso 2.7801727 6669.743
## m6_bagging 0.4188667 6774.290
## m7_rf 2.1003333 6733.078
## m8_gbm 0.8006347 7315.743
## m9_xgb 0.9140625 5525.344
#boxplot of absolute errors
boxplot(df_err[,1:4], main = "Abs.Error Dist. of models")
boxplot(df_err[,5:8], main = "Abs.Error Dist. of models")
Conclusion: So far, models XGBoost Regression and Gradient Boost are the best models that work very closely together.
Before building a neural network model, we need to scale the data.
#min_max normalization
college_data_ann <- college_data
college_data_ann$Private_b <- ifelse(college_data_ann$Private == "Yes",1,0)
college_data_ann <- college_data_ann[, -1]
head(college_data_ann, 22)
## Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate
## 1 1660 1232 721 23 52 2885 537 7440
## 2 2186 1924 512 16 29 2683 1227 12280
## 3 1428 1097 336 22 50 1036 99 11250
## 4 417 349 137 60 89 510 63 12960
## 5 193 146 55 16 44 249 869 7560
## 6 587 479 158 38 62 678 41 13500
## 7 353 340 103 17 45 416 230 13290
## 8 1899 1720 489 37 68 1594 32 13868
## 9 1038 839 227 30 63 973 306 15595
## 10 582 498 172 21 44 799 78 10468
## 11 1732 1425 472 37 75 1830 110 16548
## 12 2652 1900 484 44 77 1707 44 17080
## 13 1179 780 290 38 64 1130 638 9690
## 14 1267 1080 385 44 73 1306 28 12572
## 15 494 313 157 23 46 1317 1235 8352
## 16 1420 1093 220 9 22 1018 287 8700
## 17 4302 992 418 83 96 1593 5 19760
## 18 1216 908 423 19 40 1819 281 10100
## 19 1130 704 322 14 23 1586 326 9996
## 20 3540 2001 1016 24 54 4190 1512 5130
## 21 713 661 252 25 44 712 23 15476
## 22 7313 4664 1910 20 63 9940 1035 6806
## Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend
## 1 3300 450 2200 70 78 18.1 12 7041
## 2 6450 750 1500 29 30 12.2 16 10527
## 3 3750 400 1165 53 66 12.9 30 8735
## 4 5450 450 875 92 97 7.7 37 19016
## 5 4120 800 1500 76 72 11.9 2 10922
## 6 3335 500 675 67 73 9.4 11 9727
## 7 5720 500 1500 90 93 11.5 26 8861
## 8 4826 450 850 89 100 13.7 37 11487
## 9 4400 300 500 79 84 11.3 23 11644
## 10 3380 660 1800 40 41 11.5 15 8991
## 11 5406 500 600 82 88 11.3 31 10932
## 12 4440 400 600 73 91 9.9 41 11711
## 13 4785 600 1000 60 84 13.3 21 7940
## 14 4552 400 400 79 87 15.3 32 9305
## 15 3640 650 2449 36 69 11.1 26 8127
## 16 4780 450 1400 78 84 14.7 19 7355
## 17 5300 660 1598 93 98 8.4 63 21424
## 18 3520 550 1100 48 61 12.1 14 7994
## 19 3090 900 1320 62 66 11.5 18 10908
## 20 3592 500 2000 60 62 23.1 5 4010
## 21 3336 400 1100 69 82 11.3 35 42926
## 22 2540 96 2000 83 96 18.3 14 5854
## Grad.Rate Private_b
## 1 60 1
## 2 56 1
## 3 54 1
## 4 59 1
## 5 15 1
## 6 55 1
## 7 63 1
## 8 73 1
## 9 80 1
## 10 52 1
## 11 73 1
## 12 76 1
## 13 74 1
## 14 68 1
## 15 55 1
## 16 69 1
## 17 100 1
## 18 59 1
## 19 46 1
## 20 34 0
## 21 48 1
## 22 70 0
max <- apply(college_data_ann, 2, max)
min <- apply(college_data_ann, 2, min)
scaled_college_data <- as.data.frame(scale(college_data_ann, center = min, scale = max - min))
head(scaled_college_data, 22)
## Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 0.072687934 0.062125107 0.107912537 0.23157895 0.4725275 0.087163535
## 2 0.096901901 0.099185947 0.075035394 0.15789474 0.2197802 0.080751651
## 3 0.062008010 0.054895030 0.047349379 0.22105263 0.4505495 0.028472575
## 4 0.015467477 0.014835047 0.016045304 0.62105263 0.8791209 0.011776282
## 5 0.005155826 0.003963153 0.003146138 0.15789474 0.3846154 0.003491620
## 6 0.023293284 0.021797344 0.019348749 0.38947368 0.5824176 0.017108939
## 7 0.012521291 0.014353042 0.010696870 0.16842105 0.3956044 0.008792534
## 8 0.083690098 0.088260497 0.071417335 0.37894737 0.6483516 0.046184611
## 9 0.044054689 0.041077549 0.030202926 0.30526316 0.5934066 0.026472829
## 10 0.023063113 0.022814910 0.021551046 0.21052632 0.3846154 0.020949721
## 11 0.076002394 0.072461440 0.068743118 0.37894737 0.7252747 0.053675724
## 12 0.118353819 0.097900600 0.070630801 0.45263158 0.7472527 0.049771458
## 13 0.050545505 0.037917738 0.040113261 0.38947368 0.6043956 0.031456323
## 14 0.054596511 0.053984576 0.055057417 0.45263158 0.7032967 0.037042915
## 15 0.019012107 0.012907027 0.019191443 0.23157895 0.4065934 0.037392077
## 16 0.061639737 0.054680805 0.029101778 0.08421053 0.1428571 0.027901219
## 17 0.194310178 0.049271637 0.060248545 0.86315789 0.9560440 0.046152869
## 18 0.052248769 0.044772922 0.061035079 0.18947368 0.3406593 0.053326562
## 19 0.048289831 0.033847472 0.045147082 0.13684211 0.1538462 0.045930675
## 20 0.159232150 0.103309769 0.154318075 0.24210526 0.4945055 0.128586846
## 21 0.029093587 0.031544559 0.034135599 0.25263158 0.3846154 0.018188167
## 22 0.332919026 0.245929734 0.294950448 0.20000000 0.5934066 0.311103352
## P.Undergrad Outstate Room.Board Books Personal PhD Terminal
## 1 0.0245477444 0.2634298 0.2395965 0.15775401 0.29770992 0.6739130 0.71052632
## 2 0.0561483856 0.5134298 0.7361286 0.29144385 0.19083969 0.2282609 0.07894737
## 3 0.0044882070 0.4602273 0.3105296 0.13547237 0.13969466 0.4891304 0.55263158
## 4 0.0028394779 0.5485537 0.5784994 0.15775401 0.09541985 0.9130435 0.96052632
## 5 0.0397526906 0.2696281 0.3688525 0.31372549 0.19083969 0.7391304 0.63157895
## 6 0.0018319212 0.5764463 0.2451135 0.18003565 0.06488550 0.6413043 0.64473684
## 7 0.0104877490 0.5655992 0.6210593 0.18003565 0.19083969 0.8913043 0.90789474
## 8 0.0014197390 0.5954545 0.4801387 0.15775401 0.09160305 0.8804348 1.00000000
## 9 0.0139683994 0.6846591 0.4129887 0.09090909 0.03816794 0.7717391 0.78947368
## 10 0.0035264484 0.4198347 0.2522068 0.25133690 0.23664122 0.3478261 0.22368421
## 11 0.0049919853 0.7338843 0.5715637 0.18003565 0.05343511 0.8043478 0.84210526
## 12 0.0019693153 0.7613636 0.4192938 0.13547237 0.05343511 0.7065217 0.88157895
## 13 0.0291733455 0.3796488 0.4736759 0.22459893 0.11450382 0.5652174 0.78947368
## 14 0.0012365468 0.5285124 0.4369483 0.13547237 0.02290076 0.7717391 0.82894737
## 15 0.0565147699 0.3105372 0.2931904 0.24688057 0.33572519 0.3043478 0.59210526
## 16 0.0130982368 0.3285124 0.4728878 0.15775401 0.17557252 0.7608696 0.78947368
## 17 0.0001831921 0.8997934 0.5548550 0.25133690 0.20580153 0.9239130 0.97368421
## 18 0.0128234486 0.4008264 0.2742749 0.20231729 0.12977099 0.4347826 0.48684211
## 19 0.0148843600 0.3954545 0.2064943 0.35828877 0.16335878 0.5869565 0.55263158
## 20 0.0692008244 0.1441116 0.2856242 0.18003565 0.26717557 0.5652174 0.50000000
## 21 0.0010075567 0.6785124 0.2452711 0.13547237 0.12977099 0.6630435 0.76315789
## 22 0.0473551637 0.2306818 0.1197982 0.00000000 0.26717557 0.8152174 0.94736842
## S.F.Ratio perc.alumni Expend Grad.Rate Private_b
## 1 0.4182306 0.187500 0.07267140 0.55555556 1
## 2 0.2600536 0.250000 0.13838671 0.51111111 1
## 3 0.2788204 0.468750 0.10460535 0.48888889 1
## 4 0.1394102 0.578125 0.29841461 0.54444444 1
## 5 0.2520107 0.031250 0.14583294 0.05555556 1
## 6 0.1849866 0.171875 0.12330575 0.50000000 1
## 7 0.2412869 0.406250 0.10698060 0.58888889 1
## 8 0.3002681 0.578125 0.15648387 0.70000000 1
## 9 0.2359249 0.359375 0.15944351 0.77777778 1
## 10 0.2412869 0.234375 0.10943126 0.46666667 1
## 11 0.2359249 0.484375 0.14602145 0.70000000 1
## 12 0.1983914 0.640625 0.16070654 0.73333333 1
## 13 0.2895442 0.328125 0.08961864 0.71111111 1
## 14 0.3431635 0.500000 0.11535054 0.64444444 1
## 15 0.2305630 0.406250 0.09314382 0.50000000 1
## 16 0.3270777 0.296875 0.07859068 0.65555556 1
## 17 0.1581769 0.984375 0.34380832 1.00000000 1
## 18 0.2573727 0.218750 0.09063661 0.54444444 1
## 19 0.2412869 0.281250 0.14556902 0.40000000 1
## 20 0.5522788 0.078125 0.01553339 0.26666667 0
## 21 0.2359249 0.546875 0.74914698 0.42222222 1
## 22 0.4235925 0.218750 0.05029502 0.66666667 0
hist(scaled_college_data$Apps)
We convert the new data into two parts Train and Test.
#Divide Dataset into Train and Test
set.seed(12)
train_cases <- sample(1:nrow(scaled_college_data), nrow(scaled_college_data) * 0.7)
train_ann <- scaled_college_data[train_cases, ]
test_ann <- scaled_college_data[-train_cases, ]
dim(train_ann)
## [1] 541 18
dim(test_ann)
## [1] 233 18
summary(train_ann)
## Apps Accept Enroll Top10perc
## Min. :0.0008746 Min. :0.000964 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0319937 1st Qu.:0.028010 1st Qu.:0.03256 1st Qu.:0.1474
## Median :0.0706624 Median :0.058216 Median :0.06339 Median :0.2421
## Mean :0.1389433 Mean :0.107262 Mean :0.12192 Mean :0.2869
## 3rd Qu.:0.1747457 3rd Qu.:0.135551 3rd Qu.:0.14299 3rd Qu.:0.3684
## Max. :0.9257929 Max. :0.804627 Max. :1.00000 Max. :0.9895
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. :0.0000 Min. :0.004539 Min. :0.000000 Min. :0.0124
## 1st Qu.:0.3407 1st Qu.:0.028028 1st Qu.:0.004305 1st Qu.:0.2526
## Median :0.4945 Median :0.051359 Median :0.016121 Median :0.3931
## Mean :0.5188 Mean :0.117147 Mean :0.038893 Mean :0.4175
## 3rd Qu.:0.6703 3rd Qu.:0.142331 3rd Qu.:0.044928 3rd Qu.:0.5480
## Max. :1.0000 Max. :1.000000 Max. :1.000000 Max. :1.0000
## Room.Board Books Personal PhD
## Min. :0.01576 Min. :0.006239 Min. :0.007634 Min. :0.0000
## 1st Qu.:0.28358 1st Qu.:0.165775 1st Qu.:0.099237 1st Qu.:0.5870
## Median :0.37831 Median :0.180036 Median :0.145038 Median :0.7283
## Mean :0.40646 Mean :0.203740 Mean :0.168025 Mean :0.6994
## 3rd Qu.:0.51734 3rd Qu.:0.224599 3rd Qu.:0.221374 3rd Qu.:0.8370
## Max. :1.00000 Max. :1.000000 Max. :0.711908 Max. :1.0000
## Terminal S.F.Ratio perc.alumni Expend
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.6053 1st Qu.:0.2413 1st Qu.:0.2188 1st Qu.:0.06786
## Median :0.7500 Median :0.3003 Median :0.3281 Median :0.09693
## Mean :0.7293 Mean :0.3122 Mean :0.3589 Mean :0.12124
## 3rd Qu.:0.8947 3rd Qu.:0.3780 3rd Qu.:0.4844 3rd Qu.:0.14474
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :0.80148
## Grad.Rate Private_b
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.4889 1st Qu.:0.0000
## Median :0.6111 Median :1.0000
## Mean :0.6212 Mean :0.7264
## 3rd Qu.:0.7556 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
summary(test_ann)
## Apps Accept Enroll Top10perc
## Min. :0.00000 Min. :0.00000 Min. :0.00173 Min. :0.0000
## 1st Qu.:0.03241 1st Qu.:0.02999 1st Qu.:0.03256 1st Qu.:0.1474
## Median :0.06274 Median :0.05431 Median :0.06009 Median :0.2211
## Mean :0.11543 Mean :0.09174 Mean :0.10422 Mean :0.2633
## 3rd Qu.:0.13773 3rd Qu.:0.10551 3rd Qu.:0.11656 3rd Qu.:0.3158
## Max. :1.00000 Max. :1.00000 Max. :0.91851 Max. :1.0000
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. :0.04396 Min. :0.00000 Min. :0.0000916 Min. :0.0000
## 1st Qu.:0.37363 1st Qu.:0.02552 1st Qu.:0.0043050 1st Qu.:0.2696
## Median :0.49451 Median :0.04815 Median :0.0162125 Median :0.4069
## Mean :0.50403 Mean :0.10177 Mean :0.0394113 Mean :0.4227
## 3rd Qu.:0.62637 3rd Qu.:0.10665 3rd Qu.:0.0398443 3rd Qu.:0.5429
## Max. :1.00000 Max. :0.94839 Max. :0.5019922 Max. :0.9174
## Room.Board Books Personal PhD
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.02174
## 1st Qu.:0.2900 1st Qu.:0.1667 1st Qu.:0.08397 1st Qu.:0.58696
## Median :0.3925 Median :0.1800 Median :0.14504 Median :0.73913
## Mean :0.4062 Mean :0.1976 Mean :0.16356 Mean :0.71091
## 3rd Qu.:0.5147 3rd Qu.:0.2246 3rd Qu.:0.21756 3rd Qu.:0.85870
## Max. :0.8149 Max. :0.5811 Max. :1.00000 Max. :1.00000
## Terminal S.F.Ratio perc.alumni Expend
## Min. :0.1184 Min. :0.02145 Min. :0.0000 Min. :0.005542
## 1st Qu.:0.6316 1st Qu.:0.23861 1st Qu.:0.1719 1st Qu.:0.067073
## Median :0.8026 Median :0.29491 Median :0.3125 Median :0.100911
## Mean :0.7420 Mean :0.30624 Mean :0.3480 Mean :0.124271
## 3rd Qu.:0.8816 3rd Qu.:0.35925 3rd Qu.:0.4844 3rd Qu.:0.143137
## Max. :1.0000 Max. :0.67828 Max. :0.9375 Max. :1.000000
## Grad.Rate Private_b
## Min. :0.05556 Min. :0.0000
## 1st Qu.:0.46667 1st Qu.:0.0000
## Median :0.61111 Median :1.0000
## Mean :0.60291 Mean :0.7339
## 3rd Qu.:0.74444 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000
Note: train and test data distribution is close together which is good.
library("neuralnet")
## Warning: package 'neuralnet' was built under R version 4.0.5
model_nn <- neuralnet(Apps ~ ., data = train_ann, hidden = c(3, 5), act.fct = "logistic", linear.output = T)
plot(model_nn, col.hidden = "dark green", col.hidden.synapse = "dark green", fill = "light blue")
test_pred_nn <- compute(model_nn, test_ann[, 2:18])
y_pred_nn <- (test_pred_nn$net.result[,1] * (max(college_data_ann$Apps) - min(college_data_ann$Apps))) + min(college_data_ann$Apps)
y_test_nn <- (test_ann$Apps * (max(college_data_ann$Apps) - min(college_data_ann$Apps))) + min(college_data_ann$Apps)
plot(y_test_nn, y_pred_nn, col ='blue', pch = 16, ylab = "predicted values", xlab = "real values")
abline(0, 1, col = "red")
plot(y_test_nn, col = "red", type = "l")
lines(y_pred_nn, col = "blue")
#Calculate Root Mean Square Error (RMSE)
rmse_nn <- (sum((y_test_nn - y_pred_nn) ^ 2) / length(y_test_nn)) ^ 0.5
rmse_nn
## [1] 771.2675
abs_err_ann <- abs(y_pred_nn - y_test_nn)
summary(abs_err_ann)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.656 86.213 207.209 416.223 453.307 5334.065
sd(abs_err_ann)
## [1] 650.7141
df_err <- data.frame("m1_1" = abs_err_m1_1,
"m3_1" = abs_err_m3_1,
"m4_ridge" = abs_err_m4_ridge,
"m5_lasso" = abs_err_m5_lasso,
"m6_bagging" = abs_err_6_bagging,
"m7_rf" = abs_err_m7_rf,
"m8_gbm" = abs_err_m8_gbm,
"m9_xgb" = abs_err_m9_xgb,
"m10_ann" = abs_err_ann)
models_comp <- data.frame("mean of Abs Error" = apply(df_err, 2, mean ),
"median of Abs Error" = apply(df_err, 2, median),
"SD of Abs Error" = apply(df_err, 2, sd ),
"min of Abs Error" = apply(df_err, 2, min ),
"max of Abs Error" = apply(df_err, 2, max) )
models_comp
## mean.of.Abs.Error median.of.Abs.Error SD.of.Abs.Error
## m1_1 596.9731 390.5073 773.7654
## m3_1 596.2298 398.2408 755.0726
## m4_ridge 637.8083 405.3886 814.0771
## m5_lasso 582.5965 354.4272 757.9307
## m6_bagging 446.0091 168.2453 796.3811
## m7_rf 512.2881 222.2484 854.9199
## m8_gbm 439.9528 170.5401 808.7992
## m9_xgb 459.1854 164.3834 801.3323
## m10_ann 416.2235 207.2094 650.7141
## min.of.Abs.Error max.of.Abs.Error
## m1_1 2.0986942 6972.844
## m3_1 0.2370317 6604.283
## m4_ridge 5.1078962 6530.076
## m5_lasso 2.7801727 6669.743
## m6_bagging 0.4188667 6774.290
## m7_rf 2.1003333 6733.078
## m8_gbm 0.8006347 7315.743
## m9_xgb 0.9140625 5525.344
## m10_ann 2.6555347 5334.065
Conclusion:The three models XGBoost Regression, Gradient Boost and Artificial Neural Networks work almost closely together, but model Artificial Neural Networks performs best.
boxplot(df_err[,1:5], main = "Abs.Error Dist. of models")
boxplot(df_err[,6:9], main = "Abs.Error Dist. of models")
Questions of Modeling
Answer:In this section, 10 forecast models were made and their reports were brought and we compared them. As a result, the best model was Model Artificial Neural Networks.
In the final evaluation, we compare the results of the created prediction models to select the best model for deployment.
par(mfrow = c(3, 3))
for (i in 1:9) {
hist(df_err[,i], xlab ="", main = colnames(df_err)[i],probability = T)
lines(density(df_err[,i]), col = "red")
}
par(mfrow = c(1, 5))
for (i in 1:5) {
boxplot(df_err[, i], xlab = "", main = colnames(df_err)[i] )
}
for (i in 6:9) {
boxplot(df_err[, i], xlab = "", main = colnames(df_err)[i] )
}
Conclusion: Due to the errors created, it seems that the best model for deploying model Artificial Neural Networks.
Questions of Evaluation
Answer: The models were tested and the best model was Artificial Neural Networks. Since this was my first data analysis project and I was not fully connected to the actual project environment, I could not have a good answer to question 2.
Model Artificial Neural Networks became our best model, so we use this model for forecasting.
Questions of Deployment
The algorithm may not work well in real space, so it is better to check more models.
Questions of Conclusion
Answer: In this project, I got acquainted with different parts of a data science project. I was able to evaluate and test myself in accordance with what I learned in this course.
I created various regression prediction models and worked on them to report the best.
Finally, I think this course has been one of the best courses of my studies.
Thank you for all the hard work you put into us this year.